4. Reference Distributions#
In this notebook we compute the reference distributions for each catchment in the sample to represent the baseline period of record (POR) flow duration curve for validation. The baseline is estimated by the non-parametric Kernel Density Estimation (KDE) method, a powerful and flexible and powerful way of estimating the probability density function (PDF) of a random variable. The KDE is computed using the period of record (POR) observations from all monitoring stations with \(\geq\) five complete years of record. The resulting distribution is used to compare against the distributions estimated by the ensemble methods.
4.1. Kernel Density Estimation#
The “ground truth” FDC is represented by Kernel Density Estimation (KDE) using period of record (POR) observations from all monitoring stations with \(\geq\) five complete years of record. Streamflow distributions come in ranges covering several orders of magnitude, the (unique) values making up the distribution can be very sparse in different regions of the range, and they can have different numbers of modes. These characteristics mean even representations based on several years of observation are only approximations, and even flexible and powerful representation methods like KDE have difficulty representing some data. As a result, we use an adaptive bandwidth based on an assumed “overall” flow measurement uncertainty model. The model is plotted below, and it represents very high uncertainty at the lowest flows that can reasonably be measured, and also higher uncertainty at high flows, representing typical uncertainty encountered in hydrological field measurement. Note that it does not represent an empirically derived model of error, rather here we use a sort of conservative approximation of error for the purpose of widening the bandwidth (gaussian kernel) where observations are typically sparse (at extremes) which coincides with higher uncertainty.
4.1.1. Adaptive Bandwidth Kernel#
A simulated probability density function is estimated from observations of k nearest neighbour stations. First, k simulated series are generated by equal unit area runoff , \(\hat p = \hat P \cdot w\) where \(\hat P = [\hat p_1, \hat p_2, \cdots, \hat p_k]\) and each \(\hat p_i = f(X_i(t))\).
In both cases, the function \(f \rightarrow \hat p(x)\) represents kernel density estimation, which defines the probability density as $\(\hat p(x) = \frac{1}{n} \sum_{i=1}^{n}\frac{1}{h(x)}K\left( \frac{x-x_i}{h(x)}\right)\)$
Where \(h(x)\) reflects an adaptive kernel bandwidth that addresses vestiges of precision in the observed data to reflect the nature of streamflow as a continuous variable, and additionally incorporates piecewise linear model to represent overall measurement uncertainty.
4.1.2. Sparse support#
Through exploration of many example PDF approximations by KDE, a second vestigial problem became evident in terms of sparse support. Despite requiring a minimum of 5 years of daily records, some observation sets have extremes that cause instability in the resulting PDF if the KDE bandwidth uses a “rule of thumb”, i.e. Silverman (constant) bandwidth approximation. In some cases using the bandwidth set by the error model accounts for smoothing the distribution around sparse support (high precision compared to density of observed values), however we opted to use a second bandwidth criterion based on the distance between unique values in the dataset. The final bandwidth is then the greater of that set by the assumed error or by the precision.
from multiprocessing import Pool
from functools import partial
import os
import pandas as pd
import json
import numpy as np
import geopandas as gpd
import xarray as xr
from scipy.stats import laplace
from pathlib import Path
from bokeh.plotting import figure, show, gridplot
from bokeh.io import output_notebook
import numpy as np
output_notebook()
from utils.kde_estimator import KDEEstimator
import utils.data_processing_functions as dpf
BASE_DIR = os.getcwd()
# update this to the path where you stored `HYSETS_2023_update_QC_stations.nc`
HYSETS_DIR = Path('/home/danbot/code/common_data/HYSETS')
# plot the measurement error:
efig = figure(title="Estimated Measurement Error Model", width=600, height=400, x_axis_type='log')
error_points = np.array([1e-4, 1e-3, 1e-2, 1e-1, 1., 1e1, 1e2, 1e3, 1e4, 1e5]) # Reference flow points in m^3/s
error_values = np.array([1.0, 0.5, 0.2, 0.1, 0.1, 0.1, 0.1, 0.15, 0.2, 0.25])
efig.line(error_points, error_values, line_color='red', line_width=2, legend_label='Measurement Error Model')
efig.xaxis.axis_label = r'$$\text{Flow } m^3/s$$'
efig.yaxis.axis_label = r'$$\text{Error } [\text{x}100\%]$$'
efig.legend.background_fill_alpha = 0.5
efig = dpf.format_fig_fonts(efig, font_size=14)
layout = gridplot([efig], ncols=2, width=500, height=350)
show(layout)
4.2. Data pre-processing overview#
Note that these steps are optional and the end results of these pre-processing steps are provided in the open data repository.
4.2.1. get updated data sources and validate catchment attributes#
Extract catchment attributes using updated catchment geometries where available (optional, updated catchment geometries are saved in
data/BCUB_watershed_bounds_updated.geojson).Process climate indices for HYSETS catchments in the study region (optional, pre-processed attributes are contained in
BCUB_watershed_attributes_updated.csv)
4.3. Import HYSETS catchment attributes#
# import the HYSETS attributes data
hysets_df = pd.read_csv(HYSETS_DIR / 'HYSETS_watershed_properties.txt', sep=';')
ws_id_dict = hysets_df.set_index('Official_ID')['Watershed_ID'].to_dict()
da_dict = hysets_df.set_index('Official_ID')['Drainage_Area_km2'].to_dict()
official_id_dict = {row['Official_ID']: row['Watershed_ID'] for _, row in hysets_df.iterrows()}
4.3.1. Import the pre-filtered stations within the study region#
# import the BCUB (study) region boundary
bcub_df = gpd.read_file(os.path.join('data', f'catchment_attributes_with_runoff_stats.csv'), dtype={'official_id': str})
bcub_df['official_id'] = bcub_df['official_id'].astype(str)
# map the Hysets watershed IDs to the BCUB watershed IDs
# create a dict to map HYSETS watershed IDs to the Official station IDs
bcub_df['watershedID'] = bcub_df['official_id'].apply(lambda x: official_id_dict.get(x, None))
print(f' Found {len(bcub_df)} catchments in the BCUB region with runoff statistics.')
Found 1098 catchments in the BCUB region with runoff statistics.
/home/danbot/code/data_analysis/lib/python3.12/site-packages/pyogrio/raw.py:198: RuntimeWarning: driver CSV does not support open option DTYPE
return ogr_read(
4.4. Import streamflow timeseries#
Note
At the top of data_processing_functions.py, update the STREAMFLOW_DIR variable to match where the HYSETS streamflow time series are stored.
# Load dataset
streamflow = xr.open_dataset(HYSETS_DIR / 'HYSETS_2023_update_QC_stations.nc')
# Promote 'watershedID' to a coordinate on 'watershed'
streamflow = streamflow.assign_coords(watershedID=("watershed", streamflow["watershedID"].data))
# Set 'watershedID' as index
streamflow = streamflow.set_index(watershed="watershedID")
# Select only watershedIDs present in bcub_df
valid_ids = [int(wid) for wid in bcub_df['watershedID'].values if wid in streamflow.watershed.values]
ds = streamflow.sel(watershed=valid_ids)
USGS station IDs are integers but they are stored in the dataset with the unfortunate characteristic that different stations can have identifiers that are substrings of each other. We have to add a few extra lines of code to ensure we get the correct file.
# Confirm all watershed IDs exist in ds
bcub_ws_ids = bcub_df['watershedID'].values
ds_ids = ds.watershed.values # After selection via sel(watershed=valid_ids)
missing = [wid for wid in bcub_ws_ids if wid not in ds_ids]
n_total = len(bcub_ws_ids)
n_missing = len(missing)
assert n_missing == 0, f"{n_missing} / {n_total} watershedIDs missing from dataset. First few missing: {missing[:5]}"
4.5. Calculate Period of Record (POR) probability distribution functions for each station record#
In order to compare distributions fairly without data leakage in the subsequent prediction steps, we will first define a “global evaluation grid”, or the support over which all PDFs will be evaluated. We’ll do this by setting a global expected range of flow on a unit area basis. Since we clipped the lowest flows to 1e-4 \(m^3/s\) at import, and we know the minimum drainage area in the dataset is 0.7 \(\text{km}^2\), we can set the minimum to $10^{-4} and find the minimum and maximum unit area runoff in the dataset and leave some margin.
Since we’ll be computing KL divergences, we will also assume a prior distribution based on the heavy tailed Laplace distribution using mean and standard deviation (log) unit area runoff predicted (out of sample) from catchment attributes. This prior will be assumed on the donor/proxy PMF to avoid division by zero where the support gets very small.
# import the predicted mean and standard deviation from the previous notebook (Predict Runoff Statistics)
predicted_params = pd.read_csv('data/results/parameter_prediction_results/mean_parameter_predictions.csv', index_col=0)
param_dicts = {
'mean': predicted_params['log_uar_mean_mean_predicted'].to_dict(),
'sd': predicted_params['log_uar_std_mean_predicted'].to_dict(),
'median': predicted_params['log_uar_median_mean_predicted'].to_dict(),
'mad': predicted_params['log_uar_mad_mean_predicted'].to_dict(),
}
4.6. Define a global range of expected (unit area) runoff#
def retrieve_timeseries_discharge(stn):
watershed_id = official_id_dict[stn]
da = da_dict[stn]
try:
df = ds['discharge'].sel(watershed=str(watershed_id)).to_dataframe(name='discharge').reset_index()
except KeyError:
print(f"Warning: Station {stn} not found in dataset under watershedID {watershed_id}.")
return pd.DataFrame()
df = df.set_index('time')[['discharge']]
df.dropna(inplace=True)
df['zero_flow_flag'] = df['discharge'] == 0
# df['discharge'] = np.clip(df['discharge'], 1e-4, None)
# df.rename(columns={'discharge': stn}, inplace=True)
df[f'{stn}_uar'] = 1000 * df['discharge'] / da
df[f'{stn}_mm'] = df['discharge'] * (24 * 3.6 / da)
df['replaced_zero_flow_uar'] = df['discharge'].clip(1e-4) * (1000 / da)
df['log_uar'] = np.log(df['replaced_zero_flow_uar'])
return df
def process_station(stn, da_dict):
df = retrieve_timeseries_discharge(stn)
uar = df[f'replaced_zero_flow_uar'].dropna()
return stn, uar.min(), uar.max()
def parallel_min_max(bcub_stations, da_dict, processes=None):
processes = processes or 18
with Pool(processes=processes) as pool:
fn = partial(process_station, da_dict=da_dict)
results = pool.map(fn, bcub_stations)
global_min = float('inf')
global_max = float('-inf')
results = [res for res in results if res[1]]
for stn, min_uar, max_uar in results:
if max_uar > global_max:
global_max = max_uar
print(f'Max streamflow for {stn}: {max_uar:.2f} L/s/km²')
if min_uar < global_min:
global_min = min_uar
print(f' Min streamflow for {stn}: {min_uar:.2e} L/s/km² (DA={da_dict[stn]:.2f} km²)')
print(f"\n Global min = {global_min:.2e} L/s/km², max = {global_max:.2f} L/s/km²")
# Usage
compute_range = True
if compute_range:
parallel_min_max(bcub_df['official_id'], da_dict)
Process ForkPoolWorker-1:
Traceback (most recent call last):
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[8], line 49
47 compute_range = True
48 if compute_range:
---> 49 parallel_min_max(bcub_df['official_id'], da_dict)
Cell In[8], line 30, in parallel_min_max(bcub_stations, da_dict, processes)
28 with Pool(processes=processes) as pool:
29 fn = partial(process_station, da_dict=da_dict)
---> 30 results = pool.map(fn, bcub_stations)
32 global_min = float('inf')
33 global_max = float('-inf')
File /usr/lib/python3.12/multiprocessing/pool.py:367, in Pool.map(self, func, iterable, chunksize)
362 def map(self, func, iterable, chunksize=None):
363 '''
364 Apply `func` to each element in `iterable`, collecting the results
365 in a list that is returned.
366 '''
--> 367 return self._map_async(func, iterable, mapstar, chunksize).get()
File /usr/lib/python3.12/multiprocessing/pool.py:768, in ApplyResult.get(self, timeout)
767 def get(self, timeout=None):
--> 768 self.wait(timeout)
769 if not self.ready():
770 raise TimeoutError
File /usr/lib/python3.12/multiprocessing/pool.py:765, in ApplyResult.wait(self, timeout)
764 def wait(self, timeout=None):
--> 765 self._event.wait(timeout)
File /usr/lib/python3.12/threading.py:655, in Event.wait(self, timeout)
653 signaled = self._flag
654 if not signaled:
--> 655 signaled = self._cond.wait(timeout)
656 return signaled
File /usr/lib/python3.12/threading.py:355, in Condition.wait(self, timeout)
353 try: # restore state no matter what (e.g., KeyboardInterrupt)
354 if timeout is None:
--> 355 waiter.acquire()
356 gotit = True
357 else:
KeyboardInterrupt:
def set_uar_grid(global_min, global_max, n_grid_points=2**12):
baseline_log_grid = np.linspace(
global_min, global_max, n_grid_points
)
baseline_lin_grid = np.exp(baseline_log_grid)
log_dx = np.gradient(baseline_log_grid)
max_step_size = baseline_lin_grid[-1] - baseline_lin_grid[-2]
print(f' max step size = {max_step_size:.1f} L/s/km^2 for n={n_grid_points:.1e} grid points')
min_step_size = baseline_lin_grid[1] - baseline_lin_grid[0]
print(f' min step size = {min_step_size:.2e} L/s/km^2 for n={n_grid_points:.1e} grid points')
return baseline_lin_grid, baseline_log_grid, log_dx
# set global bounding values of UAR from the
# max_streamflow = ds['discharge'].max().values.item()
# max_streamflow = # it's actually 19400 in the dataset
baseline_lin_grid, baseline_log_grid, log_dx = set_uar_grid(np.log(1e-6), np.log(1e4), n_grid_points=2**12)
base_kde_estimator = KDEEstimator(baseline_log_grid, log_dx)
max step size = 56.1 L/s/km^2 for n=4.1e+03 grid points
min step size = 5.64e-09 L/s/km^2 for n=4.1e+03 grid points
# set the minimum record length
min_record_years = 5
# choose if you want to set a global prior distribution based on the
# Laplace fit to (out of sample) predicted location and scale
set_global_prior = False
validated_stations = bcub_df[bcub_df['n_years'].astype(float) >= min_record_years]['official_id'].values
print(f'Filtering stations with at least {min_record_years} complete years of data: {len(validated_stations)}/{len(bcub_df)} stations.')
# theres a problem with finding '0212414900' in the data, drop it
validated_stations = [stn for stn in validated_stations if stn != '0212414900']
# load the complete year dictionary
with open('data/complete_years.json', 'r') as f:
complete_year_dict = json.load(f)
Filtering stations with at least 5 complete years of data: 1098/1098 stations.
from scipy.stats import wasserstein_distance
class ReferenceDistribution:
def __init__(self, **kwargs):
for k, v in kwargs.items():
setattr(self, k, v)
def _initialize_station(self, stn):
self.stn = stn
self.df = retrieve_timeseries_discharge(stn)
self.da = self.da_dict[stn]
self.n_observations = len(self.df.dropna())
def compute_baseline_distribution_by_kde(inputs):
stn, input_config = inputs
baseline_distribution = ReferenceDistribution(**input_config)
baseline_distribution._initialize_station(stn)
kde_obj = input_config['kde_obj']
data = baseline_distribution.df['replaced_zero_flow_uar'].dropna().values
da = input_config['da_dict'][stn]
pmf, pdf = kde_obj.compute(data, da)
return (stn, pmf, pdf)
# from kde_estimator import KDEEstimator
pdf_path = Path(os.getcwd()) / 'data' / 'results' / 'baseline_distributions' / f'bcub_pdfs.csv'
pmf_path = Path(os.getcwd()) / 'data' / 'results' / 'baseline_distributions' / f'bcub_pmfs.csv'
shared_config = {
'da_dict': da_dict,
'complete_year_dict': complete_year_dict,
'kde_obj': base_kde_estimator,
}
if os.path.exists(pdf_path):
pmf_df = pd.read_csv(pmf_path, index_col='q')
pdf_df = pd.read_csv(pdf_path, index_col='q')
else:
# compute the PDF and PMF for each station
inputs = [(stn, shared_config) for stn in validated_stations]
print(len(validated_stations), 'stations meeting minimum complete period of record.')
# with Pool() as pool:
# results = pool.map(compute_baseline_distribution_by_kde, inputs)
results = []
for ip in inputs:
result = compute_baseline_distribution_by_kde(ip)
results.append(result)
if len(results) % 100 == 0:
print(f'Processed {len(results)}/{len(inputs)} stations...')
# concatenate the results
stations, pmfs, pdfs = zip(*results)
pdf_df = pd.DataFrame(np.stack(pdfs, axis=1), columns=stations)
pmf_df = pd.DataFrame(np.stack(pmfs, axis=1), columns=stations)
pdf_df['q'] = baseline_lin_grid
pmf_df['q'] = baseline_lin_grid
# save the pdf and pmf files
pdf_df.set_index('q', inplace=True)
pmf_df.set_index('q', inplace=True)
pdf_df.to_csv(pdf_path)
pmf_df.to_csv(pmf_path)
4.6.1. Compare the Adaptive KDE PMFs vs. the FFTKDE PMF#
from utils.evaluation_metrics import EvaluationMetrics
from KDEpy import FFTKDE
eval_obj = EvaluationMetrics(baseline_log_grid, log_dx)
all_results = {}
for stn, input_config in inputs:
akde_pmf = pmf_df[stn].values
# assert the pdf sums to 1
assert np.isclose(np.sum(akde_pmf), 1, atol=1e-4), f'PDF for {stn} does not sum to 1: {np.sum(akde_pmf):.4f}'
baseline_distribution = ReferenceDistribution(**input_config)
baseline_distribution._initialize_station(stn)
kde_obj = input_config['kde_obj']
data = baseline_distribution.df['replaced_zero_flow_uar'].dropna().values
try:
fft_kde_pdf = FFTKDE(bw="silverman").fit(np.log(data)).evaluate(baseline_log_grid)
# catch warning from FFTKDE and print station ID
except ValueError as e:
print(f'Warning: FFTKDE failed for station {stn} with error: {e}')
continue
# convert the FFTKDE pdf to a PMF
# Convert to PMF
fft_pmf = fft_kde_pdf * log_dx
fft_pmf /= np.sum(fft_pmf)
assert np.isclose(np.sum(fft_pmf), 1, atol=1e-4), f'PMF for {stn} does not sum to 1: {np.sum(fft_pmf):.4f}'
# compute measures between the adaptive KDE and the FFTKDE
measures = eval_obj._evaluate_fdc_metrics_from_pmf(akde_pmf, fft_pmf)
all_results[stn] = measures
/home/danbot/code/data_analysis/lib/python3.12/site-packages/KDEpy/bw_selection.py:285: UserWarning: Silverman's rule failed. Too many idential values. Setting bw = 0.2653109141748294
warnings.warn(
/home/danbot/code/data_analysis/lib/python3.12/site-packages/KDEpy/bw_selection.py:285: UserWarning: Silverman's rule failed. Too many idential values. Setting bw = 0.3589583777837614
warnings.warn(
# format to a DataFrame with the station ID as index
results_df = pd.DataFrame(all_results).T
results_df.index.name = 'station_id'
results_df.reset_index(inplace=True)
results_df.to_csv('data/results/kde_fft_comparison.csv', index=False)
results_df.sort_values(by=['kld'], inplace=True, ascending=False)
bad_rmse_stns = results_df['station_id'].values[:10]
results_df.head(10)
| station_id | rmse | relative_error | ve | nse | kge | kld | emd | |
|---|---|---|---|---|---|---|---|---|
| 9 | 05AB022 | 0.089766 | 0.321285 | 0.017087 | 0.998129 | 0.966668 | 0.478741 | 0.0991 |
| 921 | 12345000 | 44.001743 | 0.286702 | -0.289957 | 0.659076 | 0.412635 | 0.398419 | 19.8344 |
| 935 | 12353820 | 18.310978 | 0.297060 | -0.333375 | 0.644401 | 0.378055 | 0.394362 | 6.8054 |
| 503 | 08NL037 | 0.518950 | 0.210084 | -0.052027 | 0.991399 | 0.902767 | 0.223512 | 0.1612 |
| 420 | 08NA056 | 0.056464 | 0.124036 | 0.008129 | 0.998570 | 0.975052 | 0.170880 | 0.1801 |
| 417 | 08NA052 | 1.238116 | 0.031092 | -0.015836 | 0.995317 | 0.955860 | 0.098915 | 0.5796 |
| 738 | 12111500 | 11.219424 | 0.107840 | -0.135755 | 0.879694 | 0.679531 | 0.097771 | 4.9752 |
| 418 | 08NA053 | 3.206015 | 0.046618 | -0.027818 | 0.992550 | 0.935026 | 0.097306 | 1.3586 |
| 471 | 08NH100 | 7.452565 | 0.041910 | -0.038559 | 0.983167 | 0.902817 | 0.072722 | 2.9773 |
| 953 | 12367500 | 0.347037 | 0.041836 | -0.025561 | 0.987886 | 0.938282 | 0.067220 | 0.4951 |
4.6.2. Plot an example FDC#
stn = '08MH147'
stave_dist = ReferenceDistribution(**shared_config)
stave_dist._initialize_station(stn)
kde_obj = shared_config['kde_obj']
# take an example slice
df = stave_dist.df[stave_dist.df.index > '2020-01-01'].copy()
qlabel = 'replaced_zero_flow_uar'
# density
data = df[qlabel].dropna().values
fft_kde_pdf = FFTKDE(bw="silverman").fit(np.log(data)).evaluate(baseline_log_grid)
# convert the FFTKDE pdf to a PMF
# Convert to PMF
fft_pmf = fft_kde_pdf * log_dx
fft_pmf /= np.sum(fft_pmf)
from bokeh.models import Label
label1 = r"$$x_1 \leq x_2 \leq \cdots \leq x_N,$$"
label2 = r"$$x = F(q_t) = \frac{1}{N}, \dots, \frac{n}{N}$$"
fdc_label1 = Label(
x=sorted_vals[int(len(sorted_vals) * 0.8)], # place at ~80th percentile of flow
y=2400, # 80% exceedance
text=label1,
text_font_size="10pt",
text_color="black",
# render_mode="canvas",
)
fdc_label2 = Label(
x=sorted_vals[int(len(sorted_vals) * 0.8)], # place at ~80th percentile of flow
y=1900, # 80% exceedance
text=label2,
text_font_size="10pt",
text_color="black",
# render_mode="canvas",
)
from bokeh.layouts import row
# Plot A: Time Series
p1 = figure(title="", x_axis_type="datetime", width=350, height=300, toolbar_location=None)
p1.line(df.index, df[qlabel], line_width=2)
p1.yaxis.axis_label = r"$$\text{UAR } [L/s/\text{km}^2]$$"
p1.xaxis.major_label_orientation = np.pi / 4 # rotate x-axis labels
# Plot B: PDF
p2 = figure(title='', width=330, height=300, x_axis_type='log', toolbar_location=None)
p2.line(baseline_lin_grid[2500:], fft_kde_pdf[2500:], line_width=2, legend_label='PDF')
p2.yaxis.axis_label = r"$$\text{Density}$$"
p2.xaxis.axis_label = r"$$\text{UAR } [L/s/\text{km}^2]$$"
# Plot C: Duration Curve (Exceedance)
sorted_vals = np.sort(df[qlabel].dropna().values)[::-1]
exceed_prob = np.linspace(0, 1, len(sorted_vals), endpoint=False)
p3 = figure(title="", width=340, height=300, toolbar_location=None)
p3.line(exceed_prob * 100, sorted_vals, line_width=2)
p3.xaxis.axis_label = r"$$\text{Exceedance Probability} (\%)$$"
p3.yaxis.axis_label = r"$$\text{UAR } [L/s/\text{km}^2]$$"
# p3.yaxis.visible=False
# p3.add_layout(fdc_label1)
# p3.add_layout(fdc_label2)
p1 = dpf.format_fig_fonts(p1, font_size=14)
p3 = dpf.format_fig_fonts(p3, font_size=14)
lt = gridplot([p1, p3, p2], ncols=3, width=350, height=300)
# lt = row([p1, p3, p2])#, ncols=3, width=350, height=300)
show(lt)
4.6.3. Compute the entropy of each PMF#
Note
The remainder of this notebook is not part of the analysis presented in the accompanying paper.
The entropy of the distribution represents the randomness or disorder of the system, and it is given by the sum of log probabilities of the system states:
# compute the entropy of the prior adjusted distribution for each station
bits = list(range(2, 13)) # set a range that is both too low and too high for the data
entropy_output_folder = Path(os.getcwd()) / 'data' / 'results' / 'entropy_results'
if not entropy_output_folder.exists():
entropy_output_folder.mkdir(parents=True, exist_ok=True)
states = [2**b for b in bits]
eps = 1e-22 # set a small epsilon to avoid numerical issues
for s in states:
q_values = pmf_df.index.values
# resample the PMF by q_values to the number of states
resampled_q = np.linspace(np.log(q_values.min()-eps), np.log(q_values.max()+eps), s)
# resampled_q = np.exp(resampled_q) # convert back to original scale
# create a new DataFrame with the resampled PMF
pmf_resampled = pd.DataFrame(
index=np.exp(np.linspace(np.log(q_values.min()),
np.log(q_values.max()),
s)),
columns=pmf_df.columns,
dtype=float
)
for stn in pmf_df.columns:
pmf = pmf_df[stn].values
bin_numbers = np.digitize(q_values, pmf_resampled.index.values)
tmp = pd.DataFrame({'pmf': pmf, 'bin': bin_numbers})
agg = tmp.groupby('bin')['pmf'].sum()
full_pmf = agg.reindex(range(1, s+1), fill_value=0)
pmf_resampled[stn] = full_pmf.values
assert np.isclose(full_pmf.sum(), 1), f'PMF for {stn} does not sum to 1: {np.sum(full_pmf):.6f}'
bits = int(np.log2(s))
pmf_resampled.index.name = 'q'
pmf_resampled.to_csv(entropy_output_folder / f'bcub_pmf_resampled_{bits}bits.csv')
from bokeh.palettes import RdYlGn
def logspace_edges_from_linear_centers(centers):
"""Given linear-space bin centers from a log-uniform KDE, return log-space edges in linear space."""
log_centers = np.log(centers)
dlog = log_centers[1] - log_centers[0]
log_edges = np.concatenate([
[log_centers[0] - dlog / 2],
log_centers + dlog / 2
])
return np.exp(log_edges) # return edges in linear space
pdf_fig = figure(title="BCUB PDF Resampled", width=750, height=400, x_axis_type='log')
entropy_distributions = {}
for s in states:
bits = int(np.log2(s))
pmf_path = entropy_output_folder / f'bcub_pmf_resampled_{bits}bits.csv'
pmf_resampled = pd.read_csv(pmf_path, index_col=0)
pmf = np.percentile(pmf_resampled, 50, axis=1) # median PMF across stations
# Compute bin edges and widths from linear bin centers
centers = pmf_resampled.index.astype(float).values
edges = logspace_edges_from_linear_centers(centers)
dx = np.diff(edges)
# Convert PMF to PDF and normalize
pdf = pmf / dx
x_centers = (edges[:-1] + edges[1:]) / 2
pdf /= np.trapezoid(pdf, x=x_centers) # ensure integral = 1
# Entropy of PMF (still valid for info content calc)
mask = pmf > 0
entropy = -np.sum(pmf[mask] * np.log2(pmf[mask]))
ratio = entropy / bits
# Plot PDF
pdf_fig.quad(
top=pdf, bottom=0,
left=edges[:-1], right=edges[1:],
fill_color=RdYlGn[11][states.index(s) % len(RdYlGn[11])],
line_color=None, alpha=0.7,
legend_label=f'{bits:.0f}b (H={entropy:.1f}, {100*ratio:.0f}%)'
)
# Final formatting
pdf_fig.legend.click_policy = 'hide'
pdf_fig.legend.location = 'top_right'
pdf_fig.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
pdf_fig.yaxis.axis_label = "Probability Density"
pdf_fig = dpf.format_fig_fonts(pdf_fig, font_size=14)
show(pdf_fig)
def compute_empirical_cdf(values):
"""Compute the empirical cumulative distribution function (CDF) of the given values."""
sorted_values = np.sort(values)
cdf = np.arange(1, len(sorted_values) + 1) / len(sorted_values)
return sorted_values, cdf
from bokeh.palettes import Viridis10
# plot the CDFs of entropy by number of bits
entropy_dist_fig = figure(title="Entropy Distributions by Number of Bits", width=750, height=400)
colours = Viridis10
for s in states:
bits = int(np.log2(s))
pmf_resampled = pd.read_csv(entropy_output_folder / f'bcub_pmf_resampled_{bits}bits.csv', index_col=0)
# drop the 'q' column
# pmf = pmf_resampled.drop(columns=['q']).values
# broadcast the computation of entropy across all columns
# to compute the entropy for each column
pmf = pmf_resampled.values
entropy_by_column = np.where(pmf > 0, pmf * np.log2(pmf), 0.0)
sample_entropy = -np.sum(entropy_by_column, axis=0)
mean_entropy = np.mean(sample_entropy)
x, cdf = compute_empirical_cdf(sample_entropy)
entropy_dist_fig.line(x, cdf, line_width=3, color=colours[states.index(s) % len(colours)],
legend_label=f'{bits:.0f}b (H={mean_entropy:.1f})')
entropy_dist_fig.legend.location = 'bottom_right'
entropy_dist_fig.legend.background_fill_alpha = 0.5
entropy_dist_fig.xaxis.axis_label = "Entropy (bits)"
entropy_dist_fig.yaxis.axis_label = "Cumulative Probability"
entropy_dist_fig.legend.click_policy = 'hide'
entropy_dist_fig = dpf.format_fig_fonts(entropy_dist_fig, font_size=14)
show(entropy_dist_fig)
def logspace_edges_from_linear_centers(centers):
"""Given linear-space bin centers from a log-uniform KDE, return log-space edges in linear space."""
log_centers = np.log(centers)
dlog = log_centers[1] - log_centers[0]
log_edges = np.concatenate([
[log_centers[0] - dlog / 2],
log_centers + dlog / 2
])
return np.exp(log_edges) # return edges in linear space
from bokeh.palettes import Viridis10 as colours
pmf_fig = figure(title=" PDF by Percentile", width=1000, height=400, x_axis_type='log')
percentiles = 2, 10, 25, 50, 75, 90, 95, 96, 98, 99
for pct in percentiles:
pmf_resampled = pd.read_csv(entropy_output_folder / f'bcub_pmf_resampled_8bits.csv', index_col=0)
pmf = np.percentile(pmf_resampled, pct, axis=1)
linear_centers = pmf_resampled.index.astype(float).values
edges = logspace_edges_from_linear_centers(linear_centers)
dx = np.diff(edges) # linear bin widths corresponding to log-space bins
pdf = pmf / dx # convert PMF to PDF
# compute the area under the PDF
area = np.trapezoid(pdf, x=edges[:-1])
pdf = pdf / area # normalize the PDF to have unit area
area = np.trapezoid(pdf, x=edges[:-1])
assert np.isclose(area, 1), f'Area under PDF for {pct}th percentile does not equal 1: {area:.6f}'
# print(asdf)
entropy_by_column = np.where(pmf > 0, pmf * np.log2(pmf), 0.0)
sample_entropy = -np.sum(entropy_by_column, axis=0)
# print(len(dx), len(edges), len(pmf))
# print(asdf)
bits = np.log2(s)
ratio = entropy / bits
# add an edge at the end to close the PMF
pmf_fig.line(
x=edges[:-1],
y=pdf,
line_width=4,
color=colours[percentiles.index(pct) % len(colours)],
legend_label=f'{pct}th Percentile (H={entropy:.1f} {100*ratio:.0f}%)'
)
pmf_fig.legend.click_policy = 'hide'
pmf_fig.legend.location = 'top_right'
# pmf_fig.xaxis.axis_label = r'$$\text{Unit Area Runoff } \frac{L}{s\cdot \text{km}^2}$$'
# pmf_fig.yaxis.axis_label = r'$$\text{PDF } P(X\leq x)$$'
pmf_fig.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
pmf_fig.yaxis.axis_label = "Probability Density"
pmf_fig.legend.background_fill_alpha = 0.5
pmf_fig.add_layout(pmf_fig.legend[0], 'right')
pmf_fig = dpf.format_fig_fonts(pmf_fig, font_size=14)
show(pmf_fig)
from bokeh.models import LogColorMapper, ColorBar, FixedTicker, ColumnDataSource, LinearColorMapper
from bokeh.palettes import Viridis256
percentiles = np.linspace(0.01, 99.9, 500)
exceedance_probs = 1 - (percentiles / 100) # Convert percentiles to exceedance probabilities
pmf_resampled = pd.read_csv(entropy_output_folder / 'bcub_pmf_resampled_11bits.csv', index_col=0)
assert np.allclose(pmf_resampled.sum(), 1), "PMF does not sum to 1 across all stations."
linear_centers = pmf_resampled.index.astype(float).values
edges = logspace_edges_from_linear_centers(linear_centers)
dx = np.diff(edges)
x_vals = edges[:-1]
log_x = np.log10(x_vals)
z_matrix = []
for pct in percentiles:
pmf = np.percentile(pmf_resampled, pct, axis=1)
# pmf /= np.sum(pmf)
z_matrix.append(pmf)
z_image = np.flipud(np.array(z_matrix))
# Use log color mapping for dynamic range of PDF values
mapper = LogColorMapper(palette=Viridis256, low=z_image[z_image > 0].min(), high=z_image.max())
# Create figure with x-axis in log space (via manual transformation)
p = figure(
title="PMF across Large Sample of Watersheds",
# x_range=(log_x.min(), log_x.max()),
x_range=(x_vals.min(), x_vals.max()),
y_range=(0.01, 99.9),
# y_range=(np.min(exceedance_probs), np.max(exceedance_probs)),
width=1000,
height=500,
x_axis_type="log",
)
p.image(
image=[z_image],
x=x_vals.min(),
y=1,
dw=x_vals.max() - x_vals.min(),
dh=98,
color_mapper=mapper
)
# Axis labels
p.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
p.yaxis.axis_label = "Percentile"
# Color bar
color_bar = ColorBar(title="Probability Mass", color_mapper=mapper, label_standoff=12)
p.add_layout(color_bar, 'right')
show(p)